PROJECT BOSTON

Boston Housing Prices ML Analysis

Introduction

This project examines the key factors influencing Boston home prices using machine learning models (KNN algorithm, step-wise regression, random forest). The goal is to understand which factors drive median home values (medv) and to identify the best predictive model. We have 13 predictor variables to work with which can result in a large number of possible models. In this project, we will use a model with all 13 predictors (model 1), a model derived by analyzing correlations and random forest (model 2), and a model selected using step-wise regression (model 3). Then we’ll measure the predictive accuracy of these models by creating training and test data sets, using the training data to predict the test data, and analyze the performance results.

crim 🚨

per capita crime rate by town.

zn 🏡

proportion of residential land zoned for lots over 25,000 sq.ft.

indus 🏭

proportion of non-retail business acres per town.

chas 🌊

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox 🌫️

nitrogen oxides concentration (parts per 10 million).

rm 🏠

average number of rooms per dwelling.

age 🏚️

proportion of old owner-occupied units built prior to 1940.

dis 🚉

weighted mean of distances to five Boston employment centres.

rad 🛣️

index of accessibility to radial highways.

tax 💰

full-value property-tax rate per $10,000.

ptratio 📚

pupil-teacher ratio by town.

black ⚖️

1000(��−0.63)21000(Bk−0.63)2 where ��Bk is the proportion of blacks by town.

lstat 📉

lower status of the population (percent).

Data Inspection

      crim         zn      indus       chas        nox         rm        age 
-0.3883046  0.3604453 -0.4837252  0.1752602 -0.4273208  0.6953599 -0.3769546 
       dis        rad        tax    ptratio      black      lstat       medv 
 0.2499287 -0.3816262 -0.4685359 -0.5077867  0.3334608 -0.7376627  1.0000000 
[1] 22.53281

The associations between predictor factors and medv were derived using a correlation matrix:

From the results above, higher poverty rates are linked to lower property values, as seen by the highest negative connection (-0.7377) between medv and lstat (% of lower-status individuals).
The highest positive correlation between medv and rm (number of rooms per house) is 0.6954, indicating that larger homes are typically more costly. Higher pollution and poorer education have a negative effect on home prices, according to other variables that also have moderately negative correlations with medv, including nox (pollution). The mean (average) home price (medv) in the dataset is $22,532.81 ($22.53K). This mean was taken in 1978. Next we’ll use random forest to explore the most important factors to housing prices.

Random Forest

        IncNodePurity
rm         12574.0867
lstat      11898.8508
indus       3063.4200
nox         2747.0916
crim        2645.0800
ptratio     2607.5666
dis         2412.3478
tax         1387.2161
age         1105.2912
black        776.3449
rad          360.6068
chas         237.9042
zn           233.4437

A decision tree is a model that resembles a flowchart that divides data into smaller groups according to feature values in order to make judgments. The objective is to optimize homogeneity, or purity, in each split, which means that the target variable’s values are comparable in each group. In Random Forest, IncNodePurity (Increase in Node Purity) is a measure of feature importance. It tells us how much a variable helps in reducing impurity (variance) in the decision trees.

There are nodes in every tree:

Root Node 🌱 → Based on the most crucial feature, the initial decision split.
Internal Nodes 🔗 → Decision points in between that further divide the data.
Leaf Nodes 🍂→ The ultimate output, where predictions are produced

How it works:

  • Each decision tree in the forest splits the data at nodes based on different features.

  • The purity of each node is measured using Mean Squared Error (MSE) for regression. (If splitting on a variable reduces MSE by a large amount, it’s a good split)

  • Higher IncNodePurity means the feature is more useful in splitting data and reducing error.

According to a random forest model’s IncNodePurity measure, the most significant predictors for medv are lstat (poverty rate) and rm (number of rooms), which validate the correlation findings.
Additional important indicators include ptratio (pupil-teacher ratio), nox (pollution), and dis (distance to job centers), suggesting that environmental and educational factors are important determinants of home values. Zn (the percentage of residential property allocated for big lots) and chas (closeness to the Charles River) are less significant factors.

Stepwise Regression

Start:  AIC=1589.64
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
    tax + ptratio + black + lstat

          Df Sum of Sq   RSS    AIC
- age      1      0.06 11079 1587.7
- indus    1      2.52 11081 1587.8
<none>                 11079 1589.6
- chas     1    218.97 11298 1597.5
- tax      1    242.26 11321 1598.6
- crim     1    243.22 11322 1598.6
- zn       1    257.49 11336 1599.3
- black    1    270.63 11349 1599.8
- rad      1    479.15 11558 1609.1
- nox      1    487.16 11566 1609.4
- ptratio  1   1194.23 12273 1639.4
- dis      1   1232.41 12311 1641.0
- rm       1   1871.32 12950 1666.6
- lstat    1   2410.84 13490 1687.3

Step:  AIC=1587.65
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
    ptratio + black + lstat

          Df Sum of Sq   RSS    AIC
- indus    1      2.52 11081 1585.8
<none>                 11079 1587.7
+ age      1      0.06 11079 1589.6
- chas     1    219.91 11299 1595.6
- tax      1    242.24 11321 1596.6
- crim     1    243.20 11322 1596.6
- zn       1    260.32 11339 1597.4
- black    1    272.26 11351 1597.9
- rad      1    481.09 11560 1607.2
- nox      1    520.87 11600 1608.9
- ptratio  1   1200.23 12279 1637.7
- dis      1   1352.26 12431 1643.9
- rm       1   1959.55 13038 1668.0
- lstat    1   2718.88 13798 1696.7

Step:  AIC=1585.76
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq   RSS    AIC
<none>                 11081 1585.8
+ indus    1      2.52 11079 1587.7
+ age      1      0.06 11081 1587.8
- chas     1    227.21 11309 1594.0
- crim     1    245.37 11327 1594.8
- zn       1    257.82 11339 1595.4
- black    1    270.82 11352 1596.0
- tax      1    273.62 11355 1596.1
- rad      1    500.92 11582 1606.1
- nox      1    541.91 11623 1607.9
- ptratio  1   1206.45 12288 1636.0
- dis      1   1448.94 12530 1645.9
- rm       1   1963.66 13045 1666.3
- lstat    1   2723.48 13805 1695.0

Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
    tax + ptratio + black + lstat, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
zn            0.045845   0.013523   3.390 0.000754 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
black         0.009291   0.002674   3.475 0.000557 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Analysis

The most optimal predictors in a regression model can be found using the feature selection method known as stepwise regression. By automating the variable selection procedure, it guarantees that the model only contains the most significant predictors.

Statistical significance (p-values) and model performance (AIC/BIC/R2) determine whether variables are added or removed in stepwise regression.

Different Stepwise Regression Types:
Forward Selection: If variables enhance the model, add them one at a time after starting with none.
Backward Elimination: Begin by removing each variable in turn, starting with the least important.
Stepwise selection: is a combination of the two methods; variables are included if they are helpful and removed if they later prove to be irrelevant.

In this instance, we utilized Stepwise Selection as it’s most flexible and can prevent overfitting.

Process:

Step 1: Start with All Variables

The full model includes all 13 predictors. The Akaike Information Criterion (AIC) is 1589.64, which measures model quality (lower = better).

medv=β0​+β1​crim+β2​zn+β3​indus+β4​chas+β5​nox+β6​rm+β7​age+β8​dis+β9​rad+β10​tax+β11​ptratio+β12​black+β13​lstat+ϵ

Step 2: Remove Least Important Variable

The step-by-step procedure evaluates the contribution of each variable. Removing a variable is done if it has no discernible effect on the accuracy of the model.

The first to be eliminated was age (old houses) due to the following: P-value (not statistically significant) = 0.958. Didn’t improve R^2 much better. AIC decreased, indicating a better model fit, to 1587.7.

Step 3: Remove Other Unimportant Variables

indus (non-retail land proportion) was removed because: p-value = 0.738 (too high). AIC improved to 1585.8, meaning the model became simpler without losing accuracy.

At this point, we are left with the 11 strongest variables with indus and age removed. (There can be other weak predictors but they were kept as they improve the model slightly).

🔹 Final Adjusted R² = 0.7348
🔹 Final AIC = 1585.8

fitting the regression models


Call:
lm(formula = medv ~ ., data = boston_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Call:
lm(formula = medv ~ lstat + rm, data = boston_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.076  -3.516  -1.010   1.909  28.131 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.35827    3.17283  -0.428    0.669    
lstat       -0.64236    0.04373 -14.689   <2e-16 ***
rm           5.09479    0.44447  11.463   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.54 on 503 degrees of freedom
Multiple R-squared:  0.6386,    Adjusted R-squared:  0.6371 
F-statistic: 444.3 on 2 and 503 DF,  p-value: < 2.2e-16

Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
    tax + ptratio + black + lstat, data = boston_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
zn            0.045845   0.013523   3.390 0.000754 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
black         0.009291   0.002674   3.475 0.000557 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Interpretation

Model 1 (all predictors):

  • Adjusted R² = 0.7338 → Explains 73.38% of the variance in home prices.

  • Residual Standard Error (RSE) = 4.745 → The average error in home price predictions is $4,745.

  • F-statistic = 108.1, p < 2.2e-16 → The model is statistically significant, but includes some statistically insignificant predictors (which model 3 removes)

Model 2: (select 2 predictors)

  • Adjusted R² = 0.6371 → Explains 63.71% of the variance in home prices.

  • Residual Standard Error = 5.54 → Higher error than Model 1.

  • F-statistic = 444.3, p < 2.2e-16 → The model is significant.

Model 3: (automatically selected 11 predictors)

  • Adjusted R² = 0.7348 → Slightly better than Model 1.

  • Residual Standard Error = 4.736 → Lower than Model 1 (better predictive accuracy).

  • F-statistic = 128.2, p < 2.2e-16 → Strong significance

Model 3 so far has the highest explained variance and lowest error.

Why we can’t conclude model 3 is best based on the given ouput:

  • Standard Error (SE) and R2 (coefficient of determination) are helpful metrics for assessing how well a model fits training data, but they don’t reveal how well the model will work with new, unseen data, which is what’s important for creating predictions. (For example, a highly complex model can have an R^2 have 90% and a low standard error but it fails to predict accurately because the model overfits and captures noise instead of true relationships.

  • For this reason, before deciding on the optimal model, we require KNN or similar validation technique to test how well the model predicts unseen data.

KNN model 1

k-Nearest Neighbors 

306 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 275, 275, 275, 276, 275, 275, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   1  6.985692  0.5060300  4.970925
   2  6.257435  0.5566765  4.377462
   3  6.430624  0.5185585  4.417509
   4  6.223864  0.5479263  4.356895
   5  6.333470  0.5225443  4.444381
   6  6.377989  0.5101954  4.475222
   7  6.448804  0.4947965  4.617307
   8  6.481047  0.4897292  4.609556
   9  6.529151  0.4812203  4.628434
  10  6.637433  0.4644016  4.714632

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 4.
Root Mean Squared Error (RMSE): 6.646287 

KNN model 2

k-Nearest Neighbors 

306 samples
  2 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 275, 275, 275, 276, 275, 275, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   1  5.126110  0.7014945  3.646426
   2  4.545061  0.7457584  3.310407
   3  4.445402  0.7479194  3.146399
   4  4.344117  0.7620179  3.116306
   5  4.226525  0.7706582  3.018815
   6  4.184657  0.7728534  2.972095
   7  4.175576  0.7729026  2.995383
   8  4.190023  0.7709909  3.029928
   9  4.203446  0.7703049  3.037958
  10  4.251191  0.7676662  3.044116

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 7.
Root Mean Squared Error (RMSE): 4.981061 

KNN model 3

k-Nearest Neighbors 

306 samples
 11 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 275, 277, 276, 275, 276, 276, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   1  6.122807  0.5700169  4.267206
   2  5.969561  0.5522745  4.140640
   3  6.265553  0.5036003  4.302384
   4  6.287100  0.5052000  4.362933
   5  6.054120  0.5334745  4.149887
   6  6.335535  0.4856960  4.328062
   7  6.489349  0.4612534  4.466854
   8  6.643476  0.4390953  4.597001
   9  6.682019  0.4358769  4.590271
  10  6.714576  0.4353189  4.622883

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 2.
Root Mean Squared Error (RMSE): 6.202388 

Information on KNN Algorithm:

KNN is an instance-based, non-parametric learning algorithm that may be applied to both regression and classification problems. We use KNN because it assumes a non linear relationship in data (regression assumes linear which may not always be the case).

KNN in Regression:

  1. Train the model using cross-validation (10-fold in this case) to prevent overfitting and to estimate the model’s performance more reliably.
  2. Tune the k hyperparameter, (This is the number of neighbors used in the KNN algorithm to predict the output). This is done by testing multiple values of k (from 1 to 10) and selecting the optimal value based on the lowest Root Mean Squared Error (RMSE).
  3. The train() function trains a k-Nearest Neighbors regression model to predict the medv (Median Value of Owner-Occupied Homes) using all other variables as predictors.
  4. It evaluates the performance of the model for each k value from the grid myGrid (e.g., testing k values between 1 and 10).
  5. It uses 10-fold cross-validation to estimate the model’s accuracy and helps identify the best k value by evaluating the model’s performance (e.g., RMSE) for each fold and each k.

How K works:
To create a prediction, the algorithm considers the k nearest data points, called neighbors. It then determines the separation between two points. The most widely used approach is Euclidean Distance: The points are increasingly identical the closer they are to one another. The mean of the k-nearest neighbors serves as the forecast in KNN regression: The model finds the k closest houses and predicts median home value (medv) as the average price of those neighbors.

10-Fold Cross-Validation:

1️⃣ The dataset is randomly split into 10 equal-sized folds. (If we have 306 samples each fold contains approx 30 samples)
2️⃣ The model is trained on 9 folds and tested on the remaining 1 fold.
3️⃣ This process repeats 10 times, each time using a different fold for testing.
4️⃣ The final performance metric (e.g., RMSE, accuracy) is the average of all 10 test runs.

Why do we need cross-validation in KNN?

  • Helps find the best k for KNN (avoids overfitting/underfitting).

Interpretation of Results

(results will be slightly different each time code is ran due to randomization of cross validation)

Model 1 (all predictors):

🔹 Final k chosen: k=2 (lowest RMSE of 6.23)
🔹 R² = 0.53, meaning the model explains 53% of variance in house prices.
🔹 RMSE = 7.30, meaning the average price prediction error is $7,300.

Model 2 (chosen model)
🔹 Final chosen k=6 (smallest RMSE of 4.43)
🔹 R² = 0.74, meaning this model explains 74% of the variance in home prices.
🔹 RMSE = 4.43, meaning an average price prediction error of $4,430.

Model 3 (automated model w 11 predictors)

🔹 Final chosen k=2 (smallest RMSE of 6.09)
🔹 R² = 0.56, meaning this model explains 56% of price variance.
🔹 RMSE = 6.41, meaning an average price prediction error of $6,410.

Conclusion

The best overall model is KNN model 2 with rm (# of rooms) + lstat (population of lower status %)(RMSE = 4.43, R² = 0.74). Because the results are different from the linear regression of rm + lstat, this suggests that the relationship between home prices (medv) and predictors is slightly non-linear.

74% of variance explained is strong. RMSE of 4.43 is reasonable—errors are in the range of $4,430 on average

Though model 3 (11 predictors) had a higher R^2 and lower se when fitting the model a model that fits the training data too closely (over fitting) often ends up performing worse on new, unseen data.